rm(list=ls())
<- read.csv("https://www.dropbox.com/scl/fi/oxblp3vmk80s2203a91hf/categorical_data.csv?rlkey=50241qnzexvvmxgkj6gae5mlm&dl=1") cat_data
7 Dealing with Categorical Data - Practical
7.1 Introduction
Following on from the pre-class reading, this tutorial introduces key concepts and techniques for working with categorical data in sport data analytics.
We’ll cover:
- encoding methods
- handling missing data
- visualising distributions
- building predictive models
7.2 Load Dataset
First, load the dataset as a dataframe called [cat_data]:
7.3 Defining Categorical Data
What is categorical data?
Categorical data are variables that fall into distinct groups or categories, such as “Sport”, “Gender”, and “League”.
They can either be nominal, with no inherent order, or ordinal, where the categories have a meaningful order.
Demonstration
Steps
- Examine the structure of the data.
- Identify categorical variables.
Code
# Check the structure of the dataset
str(cat_data)
'data.frame': 500 obs. of 8 variables:
$ PlayerID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Sport : chr "Football" "Tennis" "Basketball" "Tennis" ...
$ Gender : chr "Female" "Female" "Female" "Female" ...
$ ExperienceLevel : chr "Intermediate" "Advanced" "Intermediate" "Beginner" ...
$ PerformanceScore: int 97 99 73 62 85 78 79 72 71 80 ...
$ AttendanceRate : chr "Moderate" "Moderate" "Moderate" "Low" ...
$ Region : chr "North" "West" "West" "East" ...
$ Outcome : chr "Win" "Lose" "Win" "Win" ...
# Identify categorical variables
<- sapply(cat_data, is.character) | sapply(cat_data, is.factor)
categorical_vars cat("Categorical Variables:", names(cat_data)[categorical_vars])
Categorical Variables: Sport Gender ExperienceLevel AttendanceRate Region Outcome
Exercise
- Use
str()
to review the structure of the dataset. - Review which variables are categorical, specifying whether they are nominal or ordinal.
7.4 Encoding Categorical Variables
Why encode categorical variables?
Machine learning algorithms often require numeric inputs, meaning we need to convert categorical data into numerical forms.
Encoding methods such as one-hot encoding and label encoding transform categories into numeric values that models can process.
Demonstration
Steps
- Review one-hot encoding.
- Apply one-hot encoding to [Sport] and [Gender].
- Review the new dataset structure.
Code
library(caret)
Loading required package: ggplot2
Loading required package: lattice
# One-hot encoding
<- dummyVars("~ Sport + Gender", data = cat_data)
dummy_vars <- predict(dummy_vars, newdata = cat_data)
encoded_data
# Combine encoded data with the original dataset
<- cbind(cat_data, encoded_data)
cat_data_encoded head(cat_data_encoded)
PlayerID Sport Gender ExperienceLevel PerformanceScore AttendanceRate
1 1 Football Female Intermediate 97 Moderate
2 2 Tennis Female Advanced 99 Moderate
3 3 Basketball Female Intermediate 73 Moderate
4 4 Tennis Female Beginner 62 Low
5 5 Hockey Female Beginner 85 Moderate
6 6 Football Female Advanced 78 Low
Region Outcome SportBasketball SportFootball SportHockey SportTennis
1 North Win 0 1 0 0
2 West Lose 0 0 0 1
3 West Win 1 0 0 0
4 East Win 0 0 0 1
5 West Win 0 0 1 0
6 East Win 0 1 0 0
GenderFemale GenderMale
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
Exercise
- Perform one-hot encoding for the variable [AttendanceRate].
- Review: what are the benefits and drawbacks of one-hot encoding?
Show solution
# Perform one-hot encoding on AttendanceRate
library(caret)
<- dummyVars("~ AttendanceRate", data = cat_data)
dummy_vars <- predict(dummy_vars, newdata = cat_data)
encoded_data
# Add encoded columns to the dataset
<- cbind(cat_data, encoded_data)
cat_data_encoded head(cat_data_encoded)
PlayerID Sport Gender ExperienceLevel PerformanceScore AttendanceRate
1 1 Football Female Intermediate 97 Moderate
2 2 Tennis Female Advanced 99 Moderate
3 3 Basketball Female Intermediate 73 Moderate
4 4 Tennis Female Beginner 62 Low
5 5 Hockey Female Beginner 85 Moderate
6 6 Football Female Advanced 78 Low
Region Outcome AttendanceRateHigh AttendanceRateLow AttendanceRateModerate
1 North Win 0 0 1
2 West Lose 0 0 1
3 West Win 0 0 1
4 East Win 0 1 0
5 West Win 0 0 1
6 East Win 0 1 0
7.5 Handling Missing Categorical Data
Why do we need to handle missing data?
Like other forms of data, missing values in categorical variables can distort analyses and bias our models.
Common methods to address this include mode imputation (filling in missing values with the most frequent category) or creating a new “Missing” category.
Remember, since categorical data has no numerical meaning, we can’t use the median or mean for imputation.
Demonstration
Steps
- Check for missing values.
- Apply mode imputation to [AttendanceRate].
- Add a “Missing” category to [ExperienceLevel].
Code
# Check for missing values
colSums(is.na(cat_data))
PlayerID Sport Gender ExperienceLevel
0 0 0 30
PerformanceScore AttendanceRate Region Outcome
0 50 0 0
# Mode imputation for AttendanceRate
$AttendanceRate[is.na(cat_data$AttendanceRate)] <-
cat_datanames(sort(table(cat_data$AttendanceRate), decreasing = TRUE))[1]
# Add a "Missing" category for ExperienceLevel
$ExperienceLevel[is.na(cat_data$ExperienceLevel)] <- "Missing"
cat_datatable(cat_data$ExperienceLevel)
Advanced Beginner Intermediate Missing
141 144 185 30
Exercise
- Handle missing data in [Region] by creating a “Missing” category.
- Compare the proportion of missing values before and after handling.
Show solution
<- sum(is.na(cat_data$Region))
before
# Handle missing values in Region by adding a "Missing" category
$Region[is.na(cat_data$Region)] <- "Missing"
cat_data
# Compare proportions before and after handling
<- sum(is.na(cat_data$Region))
after
cat("Missing values before handling:", before)
Missing values before handling: 0
Show solution
cat("Missing values after handling:", after)
Missing values after handling: 0
7.6 Exploratory Data Analysis (EDA) for Categorical Data
Why perform EDA?
EDA helps understand patterns, distributions, and relationships in the data.
In the context of categorical data, frequency tables and cross-tabulations are useful tools for understanding the prevalence of categories and their associations with other variables.
Demonstration
Steps
- Create frequency tables for categorical variables.
- Perform cross-tabulation to analyse relationships.
Code
# Frequency table for Sport
table(cat_data$Sport)
Basketball Football Hockey Tennis
148 209 48 95
# Cross-tabulation of Sport and Outcome
table(cat_data$Sport, cat_data$Outcome)
Lose Win
Basketball 14 134
Football 10 199
Hockey 13 35
Tennis 17 78
Exercise
- Generate a frequency table for [Region].
- Analyse the relationship between [Gender] and [Outcome] using a cross-tabulation.
Show solution
# Frequency table for Region
<- table(cat_data$Region)
freq_region freq_region
East North South West
123 127 125 125
Show solution
# Cross-tabulation of Gender and Outcome
<- table(cat_data$Gender, cat_data$Outcome)
cross_tab cross_tab
Lose Win
Female 23 219
Male 31 227
7.7 Visualising Categorical Data
Why visualise categorical data?
Just like with continuous data, visualisations reveal insights into distributions and relationships within categorical data.
For example, bar charts and stacked bar charts can portray the distribution and relationships of categorical variables.
Demonstration
Steps
- Create a bar chart for [Sport].
- Add a stacked bar chart for [Sport] and [Outcome].
Code
library(ggplot2)
# Bar chart for Sport
ggplot(cat_data, aes(x = Sport)) +
geom_bar(fill = "skyblue") +
theme_minimal()
# Stacked bar chart
ggplot(cat_data, aes(x = Sport, fill = Outcome)) +
geom_bar(position = "fill") +
theme_minimal()
Exercise
- Create a bar chart for [AttendanceRate].
- Design a stacked bar chart for [Gender] and [Outcome].
#| code-fold: true
#| code-summary: Show solution
library(ggplot2)
# Bar chart for AttendanceRate
ggplot(cat_data, aes(x = AttendanceRate)) +
geom_bar(fill = "blue") +
theme_minimal()
# Stacked bar chart for Gender and Outcome
ggplot(cat_data, aes(x = Gender, fill = Outcome)) +
geom_bar(position = "fill") +
theme_minimal()
7.8 Statistical Methods for Categorical Data
Introduction
Statistical techniques such as logistic regression, Chi-square tests, and measures of association are designed to analyse relationships involving categorical variables:
- Logistic regression predicts binary outcomes (an outcome with only two levels). It performs a similar function to correlation.
- Chi-square tests assess the independence of categorical variables. It performs a similar function to tests of difference such as t-tests and ANOVA.
Demonstration
Steps
- Perform a Logistic Regression on [Outcome]: Predict categorical outcomes (lose or win) based on other variables.
- Perform a Chi-square Test: Assess relationships between [Sport] and [Outcome].
At the moment, we can’t run a logistic regression because the [Outcome] variable is not encoded as binary values (0 and 1), which are required for logistic regression. Logistic regression in R expects the dependent variable to be either numeric (0 and 1) or a factor with exactly two levels.
Code
# First, I need to convert [Outcome] to a factor with levels "Lose" (0) and "Win" (1)
$Outcome <- factor(cat_data$Outcome, levels = c("Lose", "Win"))
cat_data
# Logistic regression
<- glm(Outcome ~ Sport + Gender + AttendanceRate,
log_model data = cat_data, family = "binomial")
summary(log_model)
Call:
glm(formula = Outcome ~ Sport + Gender + AttendanceRate, family = "binomial",
data = cat_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1516 0.4418 4.870 1.12e-06 ***
SportFootball 0.7165 0.4299 1.667 0.09556 .
SportHockey -1.2651 0.4311 -2.935 0.00334 **
SportTennis -0.7262 0.3895 -1.864 0.06227 .
GenderMale -0.2263 0.3002 -0.754 0.45102
AttendanceRateLow 0.5155 0.4561 1.130 0.25846
AttendanceRateModerate 0.1936 0.3766 0.514 0.60716
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.31 on 499 degrees of freedom
Residual deviance: 316.44 on 493 degrees of freedom
AIC: 330.44
Number of Fisher Scoring iterations: 5
# Chi-square test
chisq.test(table(cat_data$Sport, cat_data$Outcome))
Pearson's Chi-squared test
data: table(cat_data$Sport, cat_data$Outcome)
X-squared = 26.301, df = 3, p-value = 8.249e-06
Exercise
- Run logistic regression using [Region] as a predictor and [Outcome] as the outcome variable.
- Perform a Chi-square test to evaluate differences for [Gender] on the outcome [AttendanceRate].
Show solution
# Logistic regression for Region and Outcome
<- glm(Outcome ~ Region, data = cat_data, family = "binomial")
log_model summary(log_model)
Call:
glm(formula = Outcome ~ Region, family = "binomial", data = cat_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3206 0.3160 7.344 2.07e-13 ***
RegionNorth -0.5196 0.4057 -1.281 0.200
RegionSouth 0.0177 0.4467 0.040 0.968
RegionWest -0.2501 0.4246 -0.589 0.556
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.31 on 499 degrees of freedom
Residual deviance: 339.90 on 496 degrees of freedom
AIC: 347.9
Number of Fisher Scoring iterations: 5
Show solution
# Chi-square test for Gender and AttendanceRate
<- chisq.test(table(cat_data$Gender, cat_data$AttendanceRate))
chisq_test chisq_test
Pearson's Chi-squared test
data: table(cat_data$Gender, cat_data$AttendanceRate)
X-squared = 0.28328, df = 2, p-value = 0.8679
7.9 Dealing with High Cardinality
Why manage high cardinality?
High cardinality occurs when categorical variables have many unique values.
Strategies like grouping infrequent categories can simplify variables, reduce noise, and improve our model interpretability without sacrificing significant information.
Demonstration
Steps
- Combine infrequent categories in [Region].
Code
$Region <- as.character(cat_data$Region)
cat_data$Region[cat_data$Region %in% c("North", "South")] <- "North-South"
cat_datatable(cat_data$Region)
East North-South West
123 252 125
Exercise
- Combine rare categories in the [Sport] variable.
- Review: how might high cardinality impact our modeling.
Show solution
# Combine rare categories in Sport
$Sport <- as.character(cat_data$Sport)
cat_data$Sport[cat_data$Sport == "Hockey"] <- "Other"
cat_datatable(cat_data$Sport)
Basketball Football Other Tennis
148 209 48 95
7.10 Interaction Effects
Introduction
Interaction effects occur when the impact of one categorical variable depends on the level of another.
Including interaction terms in models can capture these relationships, enhancing predictive performance and providing insight into the interplay between variables.
Demonstration
Steps
- Explore interactions between categorical variables [Sport] and [Gender]
Code
# Interaction variable
$Sport_Gender <- interaction(cat_data$Sport, cat_data$Gender)
cat_datatable(cat_data$Sport_Gender)
Basketball.Female Football.Female Other.Female Tennis.Female
66 109 21 46
Basketball.Male Football.Male Other.Male Tennis.Male
82 100 27 49
Exercise
- Create an interaction variable for [Region] and [AttendanceRate].
- Analyse the new variable’s distribution.
Show solution
# Create an interaction variable for Region and AttendanceRate
$Region_Attendance <- interaction(cat_data$Region, cat_data$AttendanceRate)
cat_datatable(cat_data$Region_Attendance)
East.High North-South.High West.High
16 45 25
East.Low North-South.Low West.Low
36 70 27
East.Moderate North-South.Moderate West.Moderate
71 137 73
7.11 Sparse Data
What is “sparse data”?
Sparse data refers to categories with very few observations, which can lead to instability in models and analyses.
We can combine sparse categories into broader groups or apply “regularisation” techniques can mitigate these challenges.
Demonstration
Steps
- Identify and resolve sparse categories.
Code
# Combine sparse categories in Gender
$Gender[cat_data$Gender == "Non-binary"] <- "Other"
cat_datatable(cat_data$Gender)
Female Male
242 258
Exercise
- Identify sparse categories in [AttendanceRate] and resolve them.
Show solution
# Identify sparse categories in AttendanceRate
table(cat_data$AttendanceRate)
High Low Moderate
86 133 281
Show solution
# Combine sparse categories
$AttendanceRate <- as.character(cat_data$AttendanceRate)
cat_data$AttendanceRate[cat_data$AttendanceRate == "Low"] <- "Other"
cat_datatable(cat_data$AttendanceRate)
High Moderate Other
86 281 133
7.12 Predictive Models Using Categorical Data Only
Introduction
Predictive models can use only categorical variables to predict outcomes.
Demonstration
Code
# Logistic regression with categorical variables
<- glm(Outcome ~ Sport + Gender + Region, data = cat_data, family = "binomial")
cat_model summary(cat_model)
Call:
glm(formula = Outcome ~ Sport + Gender + Region, family = "binomial",
data = cat_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.5212 0.4385 5.750 8.94e-09 ***
SportFootball 0.7120 0.4297 1.657 0.09751 .
SportOther -1.2544 0.4313 -2.908 0.00364 **
SportTennis -0.7477 0.3887 -1.924 0.05440 .
GenderMale -0.2116 0.2994 -0.707 0.47969
RegionNorth-South -0.1971 0.3826 -0.515 0.60642
RegionWest -0.1515 0.4374 -0.346 0.72917
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.31 on 499 degrees of freedom
Residual deviance: 317.53 on 493 degrees of freedom
AIC: 331.53
Number of Fisher Scoring iterations: 5
7.13 Predictive Models Using Mixed Data
Introduction
We can also combine categorical and continuous variables in predictive models to enhance their explanatory and predictive power.
Demonstration
# Logistic regression with mixed variables
<- glm(Outcome ~ Sport + Gender + PerformanceScore,
mixed_model data = cat_data, family = "binomial")
summary(mixed_model)
Call:
glm(formula = Outcome ~ Sport + Gender + PerformanceScore, family = "binomial",
data = cat_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.66488 0.84272 0.789 0.43013
SportFootball 0.73323 0.43097 1.701 0.08888 .
SportOther -1.35516 0.43596 -3.108 0.00188 **
SportTennis -0.71933 0.39130 -1.838 0.06602 .
GenderMale -0.14581 0.30287 -0.481 0.63022
PerformanceScore 0.02290 0.01057 2.166 0.03030 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 342.31 on 499 degrees of freedom
Residual deviance: 312.97 on 494 degrees of freedom
AIC: 324.97
Number of Fisher Scoring iterations: 5